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Abstract 

We propose a flexible class of models based on scale mixture of uniform distri- 
butions to construct shrinkage priors for covariance matrix estimation. This new 
class of priors enjoys a number of advantages over the traditional scale mixture 
of normal priors, including its simplicity and flexibility in characterizing the prior 
density. We also exhibit a simple, easy to implement Gibbs sampler for posterior 
simulation which leads to efficient estimation in high dimensional problems. We 
first discuss the theory and computational details of this new approach and then 
extend the basic model to a new class of multivariate conditional autoregressive 
models for analyzing multivariate areal data. The proposed spatial model flexi- 
bly characterizes both the spatial and the outcome correlation structures at an 
appealing computational cost. Examples consisting of both synthetic and real- 
world data show the utility of this new framework in terms of robust estimation 
as well as improved predictive performance. 

Key words: Areal data; Covariance matrix; Data augmentation Gibbs sampler; Multivari- 
ate conditional autoregressive model; Scale mixture of uniform; Shrinkage; Sparsity. 

1 Introduction 

Estimation of the covariance matrix S of a multivariate random vector y is ubiquitous in 
modern statistics and is particularly challenging when the dimension of the covariance 
matrix, p, is comparable or even larger than the sample size n. For efficient inference, 
it is thus paramount to take advantage of parsimonious structure often inherent in 
these high dimensional problems. Many Bayesian approaches have been proposed for 
covariance matrix estimation by placing shrinkage priors on various parameterizations 
of the covariance matrix E. Yang & Berger (1994) proposed reference priors for S based 
on the spectral decomposition of E. Barnard et al. (2000) and Liechty et al. (2004) 
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considered shrinkage priors in terms of the correlation matrix and standard deviations. 
Daniels & Kass (1999, 2001) proposed flexible hierarchical priors based on a number of 
parameterizations of S. All of these methods use non-conjugate priors and typically rely 
on Markov chain algorithms which explore the state space locally such as Metropolis- 
Hastings methods or asymptotic approximations for posterior simulation and modeling 
fitting and are restricted to low-dimensional problems. 

A large class of sparsity modeling of the covariance matrix involves the identification 
of zeros of the inverse Q = This corresponds to the Gaussian graphical models in 
which zeros in the inverse covariance matrix uniquely determine an undirected graph 
that represents the strict conditional independencies. The Gaussian graphical model 
approach for covariance matrix estimation is attractive and has gained substantive 
attention owing to the fact that its implied conditional dependence structure provides a 
natural platform for modeling dependence of random quantities in areas such as biology, 
finance, environmental health and social sciences. The standard Bayesian approach to 
inference in Gaussian graphical models is the conjugate C-Wishart prior (Roverato, 
2002; Atay-Kayis & Massam, 2005), which places positive probability mass at zero on 
zero elements of Q. A zero constrained random matrix Q has the G-Wishart distribution 
Wg(&, D) if its density is 

pin\G) = CGib,D)''\n\^'~'^/\xp{-^tiiDn)}l{neMHG)}, (1) 

where 6 > 2 is the degree of freedom parameter, D is a symmetric positive definite 
matrix, Cg(6, D) is the normalizing constant, M+(G) is the cone of symmetric positive 
definite matrices with entries corresponding to the missing edges of G constrained to 
be equal to zero, and 1{.} is the indicator function. Although G-Wishart prior has 
been quite successfully used in many applications, it has a few important limitations. 
First, the G-Wishart prior is sometimes not very flexible because of its restrictive 
form. For example, the parameters for the degrees of freedom are the same for all 
the elements of Q. Second, unrestricted graphical model determination and covariance 
matrix estimation is computationally challenging. Recent advances for unrestricted 
graphical models (Jones et al., 2005; Wang & Carvalho, 2010; Mitsakakis et al., 2010; 
Dobra et al., 2011) all rely on the theoretical framework of Atay-Kayis & Massam 
(2005) for sparse matrix completion which is very computationally intensive. Indeed, 
for non-decomposable graphical models, we do not have a closed form expression for the 
normalizing constant CG{b,D) and thus have to resort to tedious and often unstable 
Monte Carlo integration to estimate it for both graphical model determination and 
covariance matrix estimation. 

An alternative method for Bayesian graphical model determination and estimation is 
proposed by Wong et al. (2003). They placed point mass priors at zero on zero elements 
of the partial correlation matrix and constant priors for the non-zero elements. Their 
methodology applies to both decomposable and non-decomposable models and is fitted 
by a reversible jump Metropolis-Hastings algorithm. However, it is unclear how to 
incorporate prior information about individual entries of S in their framework as the 
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mathematical convenience of constant priors is essential for their algorithm. 

Absolutely continuous priors, or equivalently, penalty functions, can also induce 
shrinkage to zero of subsets of elements of fl and represent an important and flexible 
alternative to the point mass priors. In the classical formulation, there is a rich liter- 
ature on methods for developing shrinkage estimators via different penalty functions 
including the graphical lasso models (Yuan & Lin, 2007; Friedman et al., 2008; Roth- 
man et al., 2008) and the graphical adaptive lasso models (Fan et al., 2009) among 
many others. The recent literature on Bayesian methods has focused on the posterior 
mode estimation, with little attention on the key problem of efficient inference on co- 
variance matrix based on full posterior computation, with the only exception of Wang 
(2011) which gave a fully Bayesian treatment of the graphical lasso models. One likely 
reason is the difficulty in efficiently generating posterior samples of covariance matrices 
under shrinkage priors. A fully Bayesian inference is quite desirable because it not 
only produces valid standard errors and Bayes estimators based on decision-theoretic 
framework but, perhaps more importantly, can be applied in multiple classes of mul- 
tivariate models that involve key components of unknown covariance matrices such as 
the multivariate conditional autoregressive models developed in Section 5. 

This paper proposes a class of priors and the implied Bayesian hierarchical modeling 
and computation for shrinkage estimation of covariance matrices. A key but well known 
observation is that any symmetric, unimodal density may be written as a scale mixture 
of uniform distributions. Our main strategy is to use this mixture representations 
to construct shrinkage priors compared to the traditional methods for constructing 
shrinkage priors using the scale mixture of normal distributions. As mentioned above, 
the scale mixture of uniform distribution is not new to Bayesian inference. Early usage 
of this representation includes Bayesian robust and sensitive analysis (Berger, 1985; 
Berger & Berliner, 1986) and robust regressions with heavy-tailed errors Walker et al. 
(1997). 

However, our motivations are different; we seek an approach for constructing tractable 
shrinkage priors that are both flexible and computationally efficient. We argue that the 
class of scale mixture of uniform priors provide an appealing framework for modeling a 
wide class of shrinkage estimation problems and also has the potential to be extended 
to a large class of high dimensional problems involving multivariate dependencies. We 
also highlight that a salient feature of our approach is its computational simplicity. 
We construct a simple, easy to implement Gibbs sampler based on data augmentation 
for obtaining posterior draws for a large class of shrinkage priors. To the best of our 
knowledge, none of the existing Bayesian algorithms for sparse permutation invariant 
covariance estimation can be carried out solely based on a Gibbs sampler and they 
have to rely on Metropolis-Hastings methods. Since Gibbs samplers involve global pro- 
posal moves as compared to the local proposals of Metropolis-Hastings methods, in 
high dimensions this makes a difference in both the efficiency of the sampler and the 
running time of the algorithm. Through simulation experiments, we illustrate the ro- 
bust performance of the scale mixture of uniform priors for covariance matrix, as well 
as highlighting the strength and weakness of this approach compared to those based on 
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point mass priors. Through an extension to a class of multivariate conditional autore- 
gressive models, we further illustrate that the framework of scale mixture of uniforms 
naturally allows and encourages the integration of data and expert knowledge in model 
fitting and assessment, and consequently improves the prediction. 

The rest of the paper is organized as follows. In Section 2 we outline our framework 
for constructing shrinkage priors for covariance matrices using the scale mixture of uni- 
forms. In Section 3 we construct a Gibbs sampler based on a data augmentation scheme 
for sampling from the posterior distribution. In Section 4 we conduct a simulation study 
and compare and contrast our models with existing methods. In Section 5 we extend 
our model to build shrinkage priors on multivariate conditional autoregressive models. 
In Section 6 we briefly discuss the application of our methods for shrinkage estimation 
for regression models. 



Let y = {y^^\ y^'^\ . . . , y^^^Y t>e a p-dimensional random vector having a multivari- 
ate normal distribution N(0, S) with mean zero and covariance matrix S. Let Q = 
{uJij)pxp = denote the precision matrix, i.e., the inverse of the covariance matrix 
E. Given a set of independent random samples Y = {yi, . . . ,yn)pxn of y, we wish to 
estimate the matrix Q. 

We consider the following prior distribution for the precision matrix: 



where gij{-) is a continuous, unimodal and symmetric probability density function with 
mode zero on M, M+ is the space of real valued symmetric, positive definite p x p 
matrices, > is a scale parameter controlling the strength of the shrinkage and 1a 
denotes the indicator function of the set A. Our primary motivation for constructing 
prior distributions of the form (2) is that, often in real applications the amount of prior 
information the modeler can vary across individual elements of Q. For instance, one 
might incorporate the information that the variance of certain entries of Q are close to 
0, or constrain some entries to be exactly 0. In this setting, shrinking different elements 
of n at a different rate clearly provides a flexible framework for conducting Bayesian 
inference. In addition to obtaining a flexible class of prior distributions, by using a 
mixture representation for the density Qij, we can construct a simple, efficient and easy 
to implement Gibbs sampler to draw from the posterior distribution of the precision 
matrix. 



2 Shrinkage priors for precision matrices 



2.1 Precision matrix modeling 




(2) 
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2.2 Scale mixture of uniform distributions 

Our main tool is the following theorem which says that all unimodal, symmetric den- 
sities may be expressed as scale mixture of uniform distributions. 

Theorem 1. Walker et al. (1997); Feller (1971) Suppose that 6 is a real-valued random 
quantity with a continuous, unimodal and symmetric distribution with mode zero having 
density tt{6) (— oo < 9 < oo). Suppose it' (6) exists for all 9. Then tt{9) has the form: 

r°° 1 

^(0) = I -lm<t}h{t)dt, (3) 

where h{t) oc —2t x n'it) is some density function on [0, oo). Therefore we may write 

n{9 \t) r^\J{-t,t), h{t) oc -2tx n'{t). 

The generality and simplicity of Theorem 1 allow us to characterize various shrinkage 
priors by using the special structure of (3). Indeed, as noted in Walker et al. (1997), a 
Gaussian random variable x ~ N(/i, a^) can be expressed as x | f ~ U(/i — ay/v,fi + 
ay/v),v ~ Ga(3/2, 1/2), which shows that, all of the distributions which may be written 
as a scale mixture of Gaussian distributions can indeed be expressed as a scale mixture of 
uniform distributions as well. Let us discuss a few more examples of popular shrinkage 
priors where Theorem 1 is applicable. 

A popular class of distributions for constructing shrinkage priors is the exponential 
power family given by n{9) oc exp(— |^|'^/r''), where the exponent g > controls the 
decay at the tails. The mixing density function h{t) given in (3) can be thought of as 
the "scale" parameter. In this case we have h{t) oc t"^ exp(— t^'/r''), which corresponds 
to the generalized gamma distribution. Two important special cases are the Gaussian 
distribution (g = 2), and the double-exponential distribution (g = 1), which have been 
studied extensively in the context of the Bayesian lasso regression (Park & Casella, 2008; 
Hans, 2009) and the Bayesian graphical lasso (Wang, 2011). For general g > 0, one may 
write the exponential power distribution as a scale mixture of Gaussian distributions 
(Andrews & Mallows, 1974; West, 1987). However, a fully Bayesian, computationally 
efficient analysis is not available based on Gaussian mixtures, especially in the context 
of covariance estimation and graphical models. A few approximate methods exist for 
doing inference using the exponential power prior distribution such as the variational 
method proposed by Armagan (2009). Our use of uniform mixture representation has 
the advantage of posterior simulation via an efficient Gibbs sampler for any g > as is 
shown in Section 2.3 and further exemplified in Sections 4 and 6. 

Another natural candidate for shrinkage priors is the Student-t distribution given by 
7r{9) oc (l + ^Vr2)-('^+i)/2, for which it is easy to show that h{t) oc t\l + tyT^)-^''+^^/^. 
Hence, t^/r^ is an inverted beta distribution IB(3/2, z//2). Recall that the inverted beta 
distribution IB(a, b) has the density given by p{x) oc x"^^(l + x)^"'^^lx>o- 

The generalized double Pareto distribution is given by 71(6') oc (1 + \9\/t)^^^^°'\ 
which corresponds to h{t) oc ^(l + t/r)"*^^^"''; i.e., the scale t/r follows an inverted beta 
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distribution IB(2,a). Armagan et al. (2011) investigated the properties of this class of 
shrinkage priors. 

The above discussed class of shrinkage priors are well known and documented. In 
the following we give a new distribution which we call the "logarithmic" shrinkage prior 
which seems to be new in the context of shrinkage priors. Consider the density given 
by 

7r(0) oc log(l + T^/e^) . (4) 

It is easy to show that the corresponding mixing distribution has the half-Cauchy 
density, 

/i(t)(x(l + tVr2)-il|,>o}. 

This prior has two desirable properties for shrinkage estimation: an infinite spike at 
zero and heavy tails. These are precisely the desirable characteristics of a shrinkage 
prior distribution as argued convincingly for the "horseshoe" prior in Carvalho et al. 
(2010). The horseshoe prior is constructed by scale mixture of normals, namely, 9 ~ 
N(0,cr^),cr ~ C+(0, 1), where C+(0, 1) is a standard half-Cauchy distribution on the 
positive reals with scale 1. The horseshoe prior does not have a closed form density but 
satisfies the following: 

|log(l + A/e^) < n{e) < K log(l + 2/^2), 

for a constant > 0. Clearly, our new prior (4) has identical behavior at the original 
and the tails as that of the horseshoe prior distribution with the added advantage of 
having an explicit density function unlike the horseshoe prior. 

2.3 Posterior sampling 

Let y denote the observed data. The scale mixture of uniform representation provides 
a simple way of sampling from the posterior distribution p[9 \ y) oc f{y \ 6)n{6), where 
f{y I 6) is the likelihood function and 7i{6) is the shrinkage prior density. The repre- 
sentation (3) leads to the following full conditional distributions of 6 and t (conditional 
on y) given by 

P(6' I 2/,t) oc /(y I 6*) l|0|<t, p{t \y,e) (X -n'{t)lie\<t ■ (5) 

Thus the data augmented Gibbs sampler for obtaining posterior draws from p{6, t \ y) 
involves iteratively simulating from the above two conditional distributions. Simulation 
of the former involves sampling from a truncated distribution, which is often achieved by 
breaking it down further into several Gibbs steps, while sampling the latter is achieved 
by the following theorem. 
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Table 1: Density of 6 and t for some common shrinkage prior distributions, along 
with the conditional posterior inverse cumulative probability function for sampling t. 

Densities are given up to normalizing constants. 
Density name Density for 9 Density for t Inverse CDF: F^^{u \ 9) 

Exponential power exp(-|6'| 9/^') f expi-fi /t'') {-T«(fog u) + |6l|9}i/« 

Student-^ (l+^^Vr2)-(''+l)/2 t^^^, ^ ^2/^2)-(.+3)/2 |y-2/(.+ l)(^2 ^ ^2) _ ^2|l/2 

Generalized double Pareto (1 + |6'|/r)-(i+") t(l + i/T)-(2+a) ^ _ ^ 

Logarithmic log(l + tV6'2) {l + f/T^)-^ t{(1 + r2/6'2)" - l}-i/2 

CDF, cumulative distribution function. 



Theorem 2. Suppose the shrinkage prior density 7i{6) can be represented by a scale 
mixture of uniform as in equation (3). Then the (posterior) conditional probability 
density function of the latent scale parameter t is given by 

pit \y,e) (X -n'{t) lt>|0|, 

and the corresponding (conditional) cumulative distribution function is 

u = Fit\y,e)=pr{T<t\y,9) = ""^^^ 1^1 < t . (6) 

7i{\e\) 

The advantage of the above theorem is that it gives an explicit expression of the 
conditional cumulative distribution function in terms of the prior density '/r(-). This 
provides a simple way to sample fromp(t | y, 9) using the inverse cumulative distribution 
function method whenever 7r(-) can be easily inverted. Table 1 summarizes the density 
functions of 7r(5) and h{t), and the inverse conditional cumulative distribution function 
F~^{u I y^9) for several shrinkage priors introduced in Section 2.2. We note that the 
scale mixture of uniform distributions are already used for doing inference for regression 
models using the Gibbs sampler outlined above, for instance see Qin et al. (1998). 

3 Posterior computation for precision matrices 

3.1 Gibbs sampling on given global shrinkage parameter r 

Recall that given a set of independent random samples Y = {yi, ■ ■ ■ ,yn)pxn from a 
multivariate normal distribution N(0, we wish to estimate the matrix Q using the 
prior distribution given by (2). Let T = {tij}i>j be the vector of latent scale parameters. 
For simplicity we first consider a simple case where gij{-) = g{-),mij = and Tij = r in 
this section, and then discuss the strategies for choosing r in Section 3.2. However our 
algorithms can be easily extended to the general case of unequal shrinkage parameters 
Tij. Theorem 1 suggests that the prior (2) can be represented as follows: 

p{Q \t)= I p{Q,T I r)dr (X / n [^M\^.,\<rU,}h{t.j)]dT, 
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where p(fi,T | r) oc Yl-^j [l/{2tij) l{\^^^\^ru,}h{tij)] is the joint prior and h{tij) oc 
—tijg'{tij). The joint posterior distribution of (fi,T) is then: 

pin,T\Y,T) cx |fir/2exp{-itr(51])}nhl{K-l<-*.}^?'(^«.-)], (7) 

i>j 

where S = YY^. 

The most direct approach for samphng from (7) is to update each Uij one at a time 
given the data, T, and all of the entries in Q except for Uij in a way similar to those 
proposed in Wong et al. (2003). However, this direct approach requires a separate 
Cholesky factorization for updating each Uij to find its allowable range and conditional 
distribution. It also relies on the Metropolis-Hastings step to correct the sample. We 
describe an efficient Gibbs sampler for sampling {Q, T) from (7) that involves one step 
for sampling Q and the other step for sampling T. 

Given T, the first step of our Gibbs sampler systematically scans the set of 2 x 2 
sub-matrices {fle,e '■ e = < j < i < p} to generate fl. For any e = let 

= {1, . . . ,p} be the set of vertices and note that 

\fl\ = \A\\Qy\ey\e\, 

where A, the Schur component of Qv\e,v\e, is defined hj A = Qe,e — B with B = 
Qe,v\e{^v\e,v\e)~^^v\e,e- The full Conditional density of fle,e from (7) is given by 

p(fie,e I -) CX |Ar/2exp{-^Se,eA}l|n,,,er}, 

where T = {\uJij\ < rtij} fl {\ujii\ < rta} fl {\ujjj\ < rtjj}. Thus, A is a truncated 
Wishart variate. To sample A, we write 

\ hi I J \ d2 J \ I J ' V S22 y 

with di > and d2 > 0. The joint distribution of (/12, (ii, '^2) is then given by: 

p{di, d2, hi I -) oc d'^^^^^d^^^ exp[-^tr{sii(ii + S22(^2i'^i + (^2) + '2s2idihi}] ln,,,eT, 

which implies that the univariate conditional distribution for the parameters di and d2 
is a truncated gamma distribution, and a truncated normal distribution for hi- Details 
of the parameters of the truncated region and strategies for sampling are given in the 
Appendix. Given Q, the second step of our Gibbs sampler generates T in block using 
the inverse cumulative distribution function methods described in equation (6). These 
two steps complete a Gibbs sampler for model fitting under a broad class of shrinkage 
priors for Q. 

One attractive feature of the above sampler is that it is also suitable for sampling 
Q G M~^{G), that is, Q is constrained by an undirected graph G = {V,E) where V is 
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the set of vertices and E is a. set of edges and Uij = if and only if ^ E. The 
abihty to sample fl G M~^{G) is useful when substantive prior information indicates a 
certain subset of elements in Q are indeed zero. Section 5 provides such an example 
that involves a class of multivariate spatial models. To sample Q G M~^{G), the only 
modification that is required is to replace the set of all 2 x 2 sub-matrices {fle,e '■ e = 
1 < j < < p} with the set {fle,e '■ e G E} U {Qy : v G Vi} where Vj is the set of 
isolated nodes in G. 

3.2 Choosing the shrinkage parameters 
We start with the scenario when Xjj = r and rriij = for all i > j. In this case we have 

where Gr is a normalizing term involving r. This normalizing constant is a necessary 
quantity for choosing hyper parameters for r. Since p{Q | r) is a scale family, applying 
the substitution Q = Q/r yields, 

where the integral on the right hand side of the above equation does not involve r 
because {fi : i7 G M+} = {fi : f2 G M'^}. Hence, under a hyperprior p{t), the 
conditional posterior distribution of r is 

Now the sampling scheme in Section 3.1 can be extended to include a component to 
sample r at each iteration. 

Now suppose rriij = and instead of having a single global shrinkage parameter, we 
wish to control the rate at which the individuals elements of Q are shrunk towards 
separately. A natural shrinkage prior for this problem is 

pin\T) = G;'llg,,i^) 

where Qij may all be different. The idea is that by choosing a different density Qij 
for each edge, we can incorporate the prior information for the rate at which different 
entries of Q are shrunk towards 0. For a hyper prior p{t), using an identical calculation 
as in (8) and (9) we deduce that the conditional posterior of r is then given by 

pir I Y,n) oc r-^(^'+i)/2 J]^,^.(^)p(r) . (10) 
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Notice that the Gibbs sampler presented in Section 3.1 apphes to this case as well; we 
just need to use the cumulative distribution function for the density gij for sampling 
from the conditional distribution of tij. Alternatively, one can also fix a density g 
and write p{Q \ r) = C:^:^ Y[i>j di^^) fixed positive constants Vij and then make 
inference about the common r. 

We conclude this section with the remark that our approach can be adapted for 
hierarchical models. For example, in Section 5 we consider a shrinkage prior that 
shrinks Q towards a given matrix M = {rriij) under the constraint that Q G M~^{G) for 
a given graph G: 

where E denotes the set of edges of the graph G and normalizing constant Cr^M = 
IneM+(G) Y\.{ij)eE g^ '^'^'J^'^ )dQ is the normalizing constant. In this case Cr,M is analyt- 
ically intractable as a function of r. In the example of Section 5, we fixed r at a value 
that represents prior knowledge of the distribution of fl to avoid modeling r. In some 
applications, it may be desirable to add another level of hierarchy for modeling r so 
that we can estimate it from data. Several approaches have been proposed for dealing 
with the intractable normalizing constant, see Liechty et al. (2004), Liechty et al. (2009) 
and the references therein for one such approach. 

4 Simulation experiments 

To assess the utility of the scale mixture of uniform priors, we compared a range of 
priors in this family against three alternatives: the frequentist graphical lasso method 
of Friedman et al. (2008), the Bayesian G-Wishart prior and the method of Wong et al. 
(2003). The latter two place positive prior mass on zeros. We considered four covariance 
matrices from Rothman et al. (2008): 

Model 1. An AR(1) model with cxij =0-t-i\. 

Model 2. An A'R(4) model with Ua =1, cjj^j.i = Wj-i^j =0-2, = ^i-2,i = ^i,i-3 = 

^i-3,i =0-2, Ui^i-A = ^i-A,i =0-1. 

Model 3. A sparse model with Vt = B + 6Ip where each off-diagonal entry in B is gen- 
erated independently and assigned the value 0-5 with probability a =0-1 and otherwise. 
The diagonal elements Bu are set to be 0, and 6 is chosen so that the condition number of 
fl is p. Here the condition number is defined as max(A)/ min(A) where max(A), min(A) 
respectively denote the maximum and minimum eigenvalues of the matrix Vt. 

Model 4. A dense model with the same VL as in model 3 except for a =0-5. 

For each of the above four models, we generated samples of size n = 30, 100 and di- 
mension p = 30, yielding the proportion of non-zero elements to be 6%, 25%, 10%, 50%, 
respectively. We compute the risk under two standard loss functions. Stein's loss 
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function, Li(S,S) = tr(SS^^) — log(SS^^) — p, and the squared-error loss function 
L2(S,S) = tr(E — S)^. The corresponding Bayes estimators are S^^ = {E{Q \ Y)}^-'^ 
and T,L^ = E(S | Y), respectively. We used the posterior sample mean using the Gibbs 
sampler for estimating the risk for the Bayesian methods and the maximum likelihood 
estimate for the graphical lasso method. 

When fitting graphical lasso models, we used the 10-fold cross-validation to choose 
the shrinkage parameter. When fitting the G-Wishart priors, we followed the con- 
ventional prior specification Q ~ Wg(3,/p) and used the reversible jump algorithm of 
Dobra et al. (2011) for model fitting. For both the C-Wishart priors and the methods 
of Wong et al. (2003), we used the default graphical model space prior (Carvalho & 
Scott, 2009) 

p(G) = {(l+m)(|^l)}-\ 

where m = p{p — l)/2 and |G| is the total number of edges in graph G. For the 
scale mixtures of uniforms, we considered the exponential power prior p{Q \ r) oc 
exp{— J2i<j I'^ijl''/'^'^} with q G {0-2, 1}, the generalized double- Pareto prior p{Q \ r) oc 
]^^<^.(l-|-|c<;jj|/r)~^~" and the new logarithmic prior p(i7 | r) oc Y[i<:j^og{l+T'^/ufj). For 
the choice of the global shrinkage parameters, we assumed the conjugate distribution 
~ Ga(1, 0-1) for the exponential power prior; a = 1, 1/(1 + r) ~ U(0, 1) for the 
generalized double Pareto prior as suggested by Armagan et al. (2011); and r ~ C+(0, 1) 
for the logarithmic prior as was done for the horseshoe prior in Carvalho et al. (2010). 

Twenty datasets were generated for each case. The Bayesian procedures used 15000 
iterations with the first 5000 as burn-ins. In all cases, the convergence was rapid and 
the mixing was good; the autocorrelation of each elements in Q died out typically 
after 10 lags. As for the computational cost, the scale mixture of uniforms and the 
method of Wong et al. (2003) were significantly faster than the G-Wishart method. 
For example, for model 4, the G-Wishart took about 11 hours for one dataset under 
Matlab implementation on a six core 3-3 Ghz computer running CentOS 5-0 unix ; while 
the scale mixture of uniforms and the method of Wong et al. (2003) took only about 
20 and 6 minutes respectively. The graphical lasso method is just used as a benchmark 
for calibrating the Bayesian procedures. For each dataset, all Bayesian methods were 
compared to the graphical lasso method by computing the relative loss; for example, for 
the Li loss, we computed the relative loss as Li(S, S) — Li(Sglasso! where S is any 
Bayes estimator of S and Sqlasso is the graphical lasso estimator. Thus, a negative value 
indicates that the method performs better relative to the graphical lasso procedure and 
a smaller relative loss indicates a better relative performance of the method. 

Table 2 reports the simulation results. The two approaches based on point mass 
priors outperform the continuous shrinkage methods in sparser models such as model 
1, however, they are outperformed in less sparse configurations such as model 2 and 
4. One possible explanation is that the point mass priors tend to favor sparse models 
because it encourages sparsity through a positive prior mass at zero. Finally, the 
exponential power with q =0-2, the generalized double Pareto and the logarithmic 
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Table 2: Summary of the relative Li and L2 losses for different models and different 
methods. Medians are reported while standard errors are in parentheses. 
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(M) 


-3-2 
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-1-3 (0-7) 
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-2-5 (1-7) 


-0-4 (0-9) 
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-2-3 
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-0-6 (0-8) 
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(0-6) 
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-0-8 (0-9) 
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(0-2) 
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-1-0 (0-3) 
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Wg, G-Wisliart; WCK, Wong ct al. (2003); GDP, generalized double Pareto; EP, exponential power; Log: logarithmic. 



priors have very similar performances - ranking among top models in all cases. In 
summary, the experiment illustrates that these three heavy-tailed priors in the scale 
mixture of uniform family are generally indeed good for high dimensional covariance 
matrix estimation. 

5 Application to multivariate conditional autoregressive models 

5.1 Multivariate conditional autoregressive models based on scale mixture of uniform 

priors 

Multivariate conditional autoregressive models (Banerjee et al., 2004) constitute a di- 
verse set of powerful tools for modeling multivariate spatial random variables at areal 
unit level. Let W = {wij)p^xpr be the symmetric proximity matrix of Pr areal units, 
Wij G {0, 1}, and Wa are customarily set to 0. Then W defines an undirect graph 
Gr = (VryEr) whcrc an edge G Er if and only if Wij = 1. Let Wi+ = J^j'^ij^ 

Ew = diag(wi+, . . . , Wp,.+) and M = {rriij) = Ew — pW . Let X = (xi, . . . , Xp^Y denote 
a Pr X Pc random matrix where each is a p^- dimensional vector corresponding to re- 
gion i. Following Gelfand & Vounatsou (2003), one popular version of the multivariate 
conditional autoregressive models sets the joint distribution of X as 

vec(X) ~ iV{0, (17c J^,.)"H> p = Ew-pW, Sl^ ~ W(6c, ^c), (H) 

where Qr is the PrXPr column covariance matrix, Qc is the PcXPc row covariance matrix, 
p is the coefficient measuring spatial association and is constrained to be between 
the reciprocals of the minimum and maximum eigenvalues of W to ensure that fir 
is nonsingular, and be and Dc respectively denote the degree of freedom and location 
parameters of a Wishart prior distribution for Qc- The joint distribution in (11) implies 
the following conditional distribution: 

Xi I X_i,p,l]c ~ N( ^ pw~j^ Xj,w~lnc), 

jene{i) 
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where ne(i) denotes the neighbor of region i, that is, the set of points satisfying Wij = 1. 
Evidently, the two covariance structures {Qr,^c) are crucial in determining the effects 
of spatial smoothing. For the matrix Qc, direct application of shrinkage priors can 
reduce estimation uncertainties as compared to the conjugate Wishart prior in (11). 
For Qr, one common value of p for all Xi may limit the flexibility of the model because 
it assumes the same spatial association for all regions. The recent work of Dobra et al. 
(2011) uses the G- Wishart framework to provide alternative models. Specifically, the 
authors recommend the following extensions for modeling {Qr, ^c)'- 

flrl M ^WcAbr^M), M\p = Ew-pW, Wg,(6c,I?c), (12) 

where the row graph Gr is fixed and obtained from the proximity matrix W, and 
the column graph Gc is unknown. For both models in (11) and (12), a prior for p 
was chosen to give higher probability mass to values close to 1 to encourage sufficient 
spatial dependence. In particular, Dobra et al. (2011) put equal mass on the following 
31 values: {0, 0-05, 0-1, . . . , 0-8, 0-82, . . . , 0-90, 0-91, . . ., 0-99},. Notice that fir and 
Qc are not uniquely identified since, for any c > 0, Qc^^r = (cQc) ® (^r/c) (Wang & 
West, 2009). We address this by fixing Qr,n = 1- 

Using the theory and methods for covariance matrix developed in Section 3, we 
now extend the multivariate conditional autoregressive models (11) using the scale 
mixture of uniform distributions. We consider the following two extensions for modeling 
Vlr G M+{Gr) and G M+ 

rirl p = Ew- pW, p(fic I r) cx Jj5'c(c^c,ij/rc), (13) 

i>j 

and 

/m) 

{{i,j)(^Er}U{i=jGVr} i>j 

The first extension (13) places shrinkage priors on Qc while leaving the model for Qr 
unchanged. The second extension (14) further shrinks Qr towards the matrix M = 
Ew ~ while allowing adaptive spatial smoothing by not constraining Vt^. to be 
controlled by a common parameter p. 

One practical advantage of the our model (14) over the model (12) of Dobra et al. 
(2011) is its flexibility in incorporating prior knowledge. For example, the similarity of 
spatial neighbors implies that the off-diagonal elements of VLr should be constrained to 
be negative (Banerjee et al., 2004). This point is not addressed by Dobra et al. (2011) 
as their method is only applicable when the free elements of VL^. are not truncated. In 
the scale mixture of uniform framework, this important constraint is easily achieved by 
truncating each free off-diagonal element in Vtr to be negative when sampling fi^- The 
functional form of gr{-) and the shrinkage parameter can be pre-specified through 
careful prior elicitation as follows. Using the Gibbs sampler in Section 3.1, we are able 
to simulate from the prior distribution of Vt^. for fixed gr{-) and t^.. These prior draws 
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allow us to choose gr{-) and to represent plausible ranges of spatial associations. To 
specify these ranges, one guideline can be based on the model (11) for which Gelfand & 
Vounatsou (2003) recommended a prior for p that favors the upper range of p e (0, 1). 
In light of this recommendation, we prefer those Qr and r,, that increasingly favor values 
of Uc^ij close to 1 for any (z,j) G Er and Uc^a close to for i G Such choices 
of priors integrate prior information about spatial associations and allow for varying 
spatial smoothing parameters across different regions. 

5.2 US cancer data 

Using our model, we analyze the same real data example studied by Dobra et al. (2011) 
concerning the application of multivariate spatial models for studying the US cancer 
mortality rates. The data we analyzed consists of mortality counts for 10 types of 
tumors recorded for the 48 mainland states plus the District of Columbia for the year 
2000. The data were collected by the National Center for Health Statistics. Moral- 
ity counts below 25 were treated as missing because they are regarded as unreliable 
records in cancer surveillance community. Let Yij be the number of deaths in state 
i = 1, . . . = 49 for tumor type j = 1, ... ,pc = 10. Following Dobra et al. (2011), we 
considered Poisson multivariate loglinear models with spatial random effects: 

Yij I rjij ~ Poi(?7ij), log{r]ij) = log(gi) + fXj + Xij, 

where is the population of state i, pj is the intercept of tumor type j and Xij is the 
zero-mean spatial random effect associated with state i and tumor j and has the joint 
distribution vec(X) ~ N{0, (fi^ ® ^r)"^}. 

We compared the out-of-sample predictive performance of model (13) and (14) 
against the model (11) of Gelfand & Vounatsou (2003) and model (12) of Dobra et al. 
(2011). For (11) and (12), we used the same hyper-parameter settings as in Dobra 
et al. (2011). For (13), we set gd-) to be the logarithmic density in (4) and placed 
standard half-cauchy prior on in order to expect robust performance for shrinkage 
estimation of flc as was suggested by the simulation study in Section 4. For (14), we let 
gr{ujr,ij) oc exp{-| 

l{ajr,ij<o} foi' — j & Er SO that fir is centered 

around M = W — Ew and the similarity of spatial neighbors is ensured. We did not 
choose heavy-tailed distributions for gr{-) because the sample size Pc = 10 is relatively 
small for the dimension Pr = 49 and a heavy-tailed prior can lead to a posterior dis- 
tribution of uJr,ij to be unrealistically small and Ur^u to be unrealistically large. We 
considered Tr G {0-1, 1, 10} to assess the prior sensitivity. Finally, we modeled gc{-) as 
in model (13). 

In order to assess the out-of-sample predictive performance, we replicated the 10- 
fold cross-validation experiment of Dobra et al. (2011). Specifically, we divided the 
nonmissing counts of Y into 10 bins. For each bin i, we used the samples from the 
other 9 bins as observed data and imputed the samples from bin i as missing. To 
compare different models, we then computed the predictive mean squared error and 
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mean variance as follows 



MSE 



1 



\{{i,3):Y,,>2b}\ 



{(i,j):y,,>25} 



and 



VAR 



1 



: r,, > 25}| 



{(ij):y,,>25} 



where E(Yij) and Var(Yij) are estimated using the posterior sample mean and variance 
based on the output of the analysis of one of the 10 cross-validation datasets in which 
Yij are treated as missing. All results were obtained using a Monte Carlo sample of size 
80000 after an initial, discarded burn-in of 80000 iterations. 

Figure 1 shows the raw and predicted morality rate of colon cancer. Table 3 reports 
the predictive performance as measured by the mean squared error and mean variance. 
All methods with shrinkage priors on Qc improve the prediction over the standard 
method using the Wishart prior. Among the shrinkage methods, the logarithmic prior 
outperforms the G-Wishart prior. Allowing Qr to be adaptive by setting r,. = 1 and 
10 can further reduce the mean squared error while maintaining the same predictive 
variance with the common p model. Overall, our results suggest that the models (13) 
and (14) provide more accurate prediction and narrower credible intervals than the 
competing methods for this dataset. 

To further study the prior sensitivity to the choice of r^, we plotted the marginal 
prior and posterior densities for the free off-diagonal element in Qr using samples from 
the analysis of the first cross-validation dataset. Figure 2 displays the inference for one 
element under r,. G {0-1, 1, 10}. Clearly, the marginal posterior distribution depends 
on the choice of r^. This is not surprising because the sample size is small compared 
to the dimension of ilr- The 1 and 10 seems to perform well in this example 

because the marginal posterior distribution is influenced by the data. The case Xp =0-1 
appears to be too tight and thus is not largely influenced by the data. 

On the computing time, the Matlab implementation of model (14) took about 4 
hours to complete the analysis of one of the ten cross-validation datasets, while model 
(12) of Dobra et al. (2011) took about 4 days. Additionally, Dobra et al. (2011) reported 
a runtime of about 22 hours on a dual-core 2-8 Ghz computer under C-f-|- implemen- 
tation for a similar dataset of size pr = 49 and pc = 11. As mentioned above, our 
models based on the scale mixture of uniforms are not only more flexible but also more 
computationally efficient. 



In this section we briefly investigate the properties of the shrinkage prior constructed 
from scale mixture of uniforms for the linear regression models. Recently, shrinkage 
estimation for linear models have received a lot of attention (Park & Casella, 2008; 
Griffin & Brown, 2010; Armagan et al., 2011) all of which proceed via the scale mixture 
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(a) Raw mortality rate 



(b) Predicted mortality rate 




0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 



Figure 1: US cancer mortality map of colon cancer (per 10000 habitants), (a) The raw 
mortality rate, (b) The predicted mortality rate under model TDE+Log with = 1. 



Table 3: Predictive mean squared error and variance in 10-fold cross-validation predic- 
tive performance in the cancer mortality example. 

GV DLR Common p+Log TDE+Log 

MSE 3126 2728 2340 2238 2187 2359 

VAR 9177 6493 3814 3850 3810 3694 

GV: the non-shrinkage model (11) of Gelfand & Vounatsou (2003); DLR: model (12) of Dobra 
et al. (2011); Common p+Log: model (13) under common p for fir and logarithmic prior for fi^; 
TDE+Log: model (14) under truncated double-exponential prior for fl^. with fixed but different 
and logarithmic prior for i7c- 



(a) Tr =0-1 (b) Tr =1 (c) T,. =10 




a . ' ' ' ' ' a . ' (0 



Figure 2: Marginal prior (dashed lines) and posterior (solid lines) densities of one free 
off-diagonal element in Qj. from the analysis under model (14) with three different values 

of Tr'. (a) Tr =0-1, (b) Tr =1, (c) Tr =10. 
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of normals. Walker et al. (1997) and Qin et al. (1998) were among the first to use 
the scale mixture of uniform priors for regression models. However, they used this 
family only for modeling the measurement errors and deriving the corresponding Gibbs 
sampler. To the best of our knowledge, we are the first to investigate the scale mixture 
of uniforms as a class of shrinkage priors for regression coefficients. When this paper was 
nearing completion we were notified of a similar approach in the very recent work Poison 
& Scott (2011) in which the authors independently propose a similar construction based 
on mixtures of Bartlett-Fejer kernels for the bridge regression and proceed via a result 
similar to Theorem 1. 

Consider the following version of a regularized Bayesian linear model where the goal 
is to sample from the posterior distribution 

p(/3 I a, r, Y) oc exp{--^(F - X/3)-(r - X/3)} n^(^) 

where g{-) is the shrinkage prior and r is the global shrinkage parameter. Theorem 1 
suggests we can introduce latent variable t = {ti, . . . ,tp} such that the joint posterior 
of (/3, t) is given by: 

1 ^ 
p{P,t I a,r,r) ocexp{- — (r-X/3)-(r-X/3)}n{-^7'(i.)lw>|/3,|}} 

^ i=i 

The Gibbs samplers are then implemented by (a) simulating (]j from a truncated normal 
for each j, and (b) block simulating {ti, . . . ,tp} from using the conditional cumulative 
distribution function in Theorem 2. 

We compare the posterior mean estimators under the exponential power prior with 
q =0-2 and the logarithmic prior to the posterior means corresponding to several other 
existing priors. These two shrinkage priors are interesting because the exponential 
power prior is the Bayesian analog of the bridge regression (Park & Casella, 2008) 
and is challenging for fully posterior analysis using the scale mixture of normals and 
relatively unexplored before, and the logarithmic prior is a new prior that resembles the 
class of horseshoe priors that are shown to have some advantages over many existing 
approaches (Carvalho et al., 2010). 

We use the setting of simulation experiments considered in Armagan et al. (2011). 
Specifically, we generate n = 50 observations from y = x'^f3+e, e ~ N(0, 3^), where (3 has 
one of the following five configurations: (i) /3 =(1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)^, 
(ii) /3(3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0f , (iii) /3 =(1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0)^, 
(iv) (3 =(3,3,3,3,3,0,0,0,0,0,3,3,3,3,3,0,0,0,0,0)^, (v) /3 =(0-85, . . ., 0-85)^, and x = 
{xi, . . . , XpY has one of the following two scenarios: (a) Xj are independently and iden- 
tically distributed standard normals, (b) x is a multivariate normal with E{x) = and 
COv{Xj, Xji ) = 0-5l^-^'l. The variance is assumed to have the Jeffrey's prior p{a'^) oc 1/cr^. 
The global shrinkage parameter is assumed to have the conjugate t~'^ ~ Ga(l,l) for 
the exponential power prior with q =0-2, and r ~ C^(0, 1) for the logarithmic prior. 
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Table 4: Summary of model errors for the simulation study in the regression analy- 
sis of Section 6. Median model errors are reported; bootstrap standard errors are in 
parentheses. 
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Model error is calculated using the Mahalanobis distance (/3 — — /?) where Sx 

is the covariance matrix used to generate X. 

Table 4 reports the median model errors and the bootstrap standard errors based 
on 100 datasets for each case. Results for cases other than the exponential power prior 
with q =0-2 and the logarithmic prior are based on the reported values of Armagan 
et al. (2011). Except for model (iii) and (v) in the correlated predictor scenario, the ex- 
ponential power prior with g = 1 is outperformed by other methods. The performances 
of the exponential power prior with q =0-2 and the logarithmic prior are comparable 
with those of the generalized Pareto and the horseshoe priors. 



7 Conclusion 



The scale mixture of uniform prior provides a unified framework for shrinkage estima- 
tion of covariance matrices for a wide class of prior distributions. Further research on 
the scale mixture of uniform distributions is of interest in developing theoretical insights 
as well as computational advances in shrinkage prior estimation for Bayesian analysis of 
covariance matrices and other related models. One obvious next step is to investigate 
the covariance selection models that encourage exact zeros on a subset of elements of 
Q under the scale mixture uniform priors. Such extensions can potentially combine the 
flexibility of the scale mixture of uniform priors and the interpretation of the graphs 
implied by exact zero elements. Another interesting research direction is the generaliza- 
tion of the basic random sampling models to dynamic settings that allow the covariance 
structure to be time-varying. Such models are useful for analyzing high-dimensional 
time series data encountered in areas such as finance and environmental sciences. We 
are current investigating these extensions and we expect the Gibbs sampler developed 
in Section 3.1 to play a key role in model fitting in these settings. 
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Appendix 

Details of sampling algorithm in Section 3.1 
The joint distribution of (/12, di, (^2) is: 

p{di,d2,l2i I -) oc (i"^^^^ci2''^exp[-^tr{sii(ii + S22{liidi + ^2) + 2s2idil2i}] l{f7,,,er}- 
Clearly, the full conditional distribution for di, ^2 and I21 are given by 

di ~ GA{n/2 + 2, (sii + S22II1 + 2s2i/2i)/2} l{n,,,er} , 

d2 ~ GA(n/2 + l,S22/2) IfHe.eer} and ki ~ N{s2i/s22, l/(s22C?i)} l{n,,,er}, respectively. 
To identify the truncated region T, recall 

n,, = A + B, A=( f} Y i?=fJ"M- 

V ^1^21 C^l/il +d2 J ' \b21 &22 J 

The set T = < ty} fl < tjj} fl < tjj} can be written as 

{\di + bn\< tii} n {lc/1/21 + 621I < ty} n + ^2 + < tjj}. (15) 

Given {B,tii,tij,tjj}, (15) gives straightforward expressions for the truncated region of 
each variable in [di, d2, hi) conditional on the other two. 

Sampling a univariate truncated normal can be carried out efficiently using the 
method of Robert (1995), while sampling a truncated gamma is based on the inverse 
cumulative distribution function method. 
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